CSCN 8010 - Foundations of Machine Learning¶

Final Project - Music Genre Classification¶

By: Aneesh Ramesh (8914620) and Troy Mazerolle (8972394)¶

Introduction¶

For this project, we want to explore how we can classify genres of music. The challenge is to convert an audio file into a form of data that we know how to work with.

To do this, we will use what are called spectrograms. Spectrograms are a graphical representation of the frequency content of a signal as it evolves over time. It is a two-dimensional display where time is plotted on the x-axis, frequency on the y-axis, and the color or intensity represents the magnitude or amplitude of the signal's frequency components. Spectrograms are widely employed in fields like signal processing and acoustics to analyze and visualize audio signals, providing insights into the changing frequencies within a given time frame. This tool is crucial for tasks such as speech analysis, music processing, and various scientific and engineering applications where understanding the temporal and spectral characteristics of a signal is essential.

This report will provide a walkthrough of our analysis, from loading the data to our accuracy results. As usual, we start with loading in all our libraries.

In [ ]:
# Tensorflow Libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.utils import image_dataset_from_directory

# Plotting and Model Evaulation Libraries
import matplotlib.pyplot as plt
import IPython.display as ipd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

# Utility Libraries
import numpy as np
import os

# Setting seed for reproducibility
keras.utils.set_random_seed(42) 

verbose = 0

Data Loading¶

We now load in the spectrograms from our folder, and split them into training, validation, and test sets. For the code used to create the spectrograms from the audio files, see Appendix A.

In [ ]:
training_images_filepath = "./Data/spec_original"
category_labels = os.listdir(training_images_filepath)

xdim = 180 
ydim = 180

spectograms = image_dataset_from_directory(
    training_images_filepath,
    image_size = (xdim, ydim),
    batch_size = 108)

## Use num_batches - 2 batches for training, 1 batch for validation, 1 batch for testing
num_batches = tf.data.experimental.cardinality(spectograms).numpy()
train = spectograms.take(num_batches - 2).cache()
remaining = spectograms.skip(num_batches - 2)
validation = remaining.take(1).cache()
test = remaining.skip(1).cache()
Found 1080 files belonging to 11 classes.

Before continuing, we can get a sense of what some of our spectrograms look like. Below we will output the first five spectrograms in the validation set and their corresponding label.

In [ ]:
for images, labels in validation:
    plt.figure(figsize=(15, 500))
    for i in range(5):
        plt.subplot(1, 5, i + 1)
        plt.imshow(images[i].numpy().astype("uint8"))
        plt.title(f"Label: {category_labels[labels[i].numpy()]}")
        plt.axis("off")
    plt.show()

Model Fine-Tuning¶

We will now start building the model to classify the spectrograms. To do this, we will fine-tune the VGG16 model like we did in lab 10.

In [ ]:
conv_base = keras.applications.vgg16.VGG16(
    weights = "imagenet",
    include_top = False,
    input_shape = (xdim, ydim, 3))
conv_base.summary()
Model: "vgg16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 180, 180, 3)]     0         
                                                                 
 block1_conv1 (Conv2D)       (None, 180, 180, 64)      1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 180, 180, 64)      36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 90, 90, 64)        0         
                                                                 
 block2_conv1 (Conv2D)       (None, 90, 90, 128)       73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 90, 90, 128)       147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 45, 45, 128)       0         
                                                                 
 block3_conv1 (Conv2D)       (None, 45, 45, 256)       295168    
                                                                 
 block3_conv2 (Conv2D)       (None, 45, 45, 256)       590080    
                                                                 
 block3_conv3 (Conv2D)       (None, 45, 45, 256)       590080    
                                                                 
 block3_pool (MaxPooling2D)  (None, 22, 22, 256)       0         
                                                                 
 block4_conv1 (Conv2D)       (None, 22, 22, 512)       1180160   
                                                                 
 block4_conv2 (Conv2D)       (None, 22, 22, 512)       2359808   
                                                                 
 block4_conv3 (Conv2D)       (None, 22, 22, 512)       2359808   
                                                                 
 block4_pool (MaxPooling2D)  (None, 11, 11, 512)       0         
                                                                 
 block5_conv1 (Conv2D)       (None, 11, 11, 512)       2359808   
                                                                 
 block5_conv2 (Conv2D)       (None, 11, 11, 512)       2359808   
                                                                 
 block5_conv3 (Conv2D)       (None, 11, 11, 512)       2359808   
                                                                 
 block5_pool (MaxPooling2D)  (None, 5, 5, 512)         0         
                                                                 
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0
_________________________________________________________________

For the first step of our fine-tuning, we will freeze the base model and add in our own head to train. In this part we will train a small variety of different heads and measure their results. We will then move forward with only the best model.

Simple Model¶

In [ ]:
conv_base.trainable = False

inputs = keras.Input(shape=(xdim, ydim, 3))
x = keras.applications.vgg16.preprocess_input(inputs)
x = conv_base(inputs)
x = layers.Flatten()(x)
x = layers.Dense(256, activation = "relu")(x)
outputs = layers.Dense(len(category_labels), activation="softmax")(x)
model_simp = keras.Model(inputs, outputs)

model_simp.compile(loss="sparse_categorical_crossentropy",
              optimizer="rmsprop",
              metrics=["accuracy"])

history_simp = model_simp.fit(
    train,
    epochs = 50,
    validation_data = validation,
    verbose = verbose)

plt.plot(history_simp.history["accuracy"])
plt.plot(history_simp.history["val_accuracy"])
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend(["Training Set", "Validation Set"]);

The simple model by itself gets about a 90% accuracy.

We will now add some more layers and see if we can improve performance.

Medium Model¶

In [ ]:
conv_base.trainable = False

inputs = keras.Input(shape=(xdim, ydim, 3))
x = keras.applications.vgg16.preprocess_input(inputs)
x = conv_base(inputs)
x = layers.Flatten()(x)
x = layers.Dense(256, activation = "relu")(x)
x = layers.Dense(256, activation = "relu")(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(len(category_labels), activation="softmax")(x)
model_med = keras.Model(inputs, outputs)

model_med.compile(loss="sparse_categorical_crossentropy",
              optimizer="rmsprop",
              metrics=["accuracy"])

history_med = model_med.fit(
    train,
    epochs = 50,
    validation_data = validation,
    verbose = verbose)

plt.plot(history_med.history["accuracy"])
plt.plot(history_med.history["val_accuracy"])
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend(["Training Set", "Validation Set"]);

Surprisingly, the model with an extra dense layer and a dropout layer does not see an increase in performance. Between the simple model and the medium model, we should use the simple model because it has the same performance with less complexity.

For our final model, we will test a very complex network with five layers and a dropout layer.

Complex Model¶

In [ ]:
conv_base.trainable = False

inputs = keras.Input(shape=(xdim, ydim, 3))
x = keras.applications.vgg16.preprocess_input(inputs)
x = conv_base(inputs)
x = layers.Flatten()(x)
x = layers.Dense(256, activation = "relu")(x)
x = layers.Dense(128, activation = "relu")(x)
x = layers.Dense(64, activation = "relu")(x)
x = layers.Dense(32, activation = "relu")(x)
x = layers.Dense(16, activation = "relu")(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(len(category_labels), activation="softmax")(x)
model_comp = keras.Model(inputs, outputs)

model_comp.compile(loss="sparse_categorical_crossentropy",
              optimizer="rmsprop",
              metrics=["accuracy"])

history_comp = model_comp.fit(
    train,
    epochs = 50,
    validation_data = validation,
    verbose = verbose)

plt.plot(history_comp.history["accuracy"])
plt.plot(history_comp.history["val_accuracy"])
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend(["Training Set", "Validation Set"]);

Despite being more complex, this model does worse than the first two models. The complexity of this model is overfitting the data.

Moving forward, we will use the simple model as the head. Now that the head of our model is moderately trained, we can unfreeze the last four layers of the VGG16 model and fine-tune them.

In [ ]:
conv_base.trainable = True
for layer in conv_base.layers[:-4]:
    layer.trainable = False

model_simp.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.RMSprop(learning_rate=1e-5),
              metrics=["accuracy"])

history = model_simp.fit(
    train,
    epochs = 10,
    validation_data = validation,
    verbose = verbose)

plt.plot(history.history["accuracy"])
plt.plot(history.history["val_accuracy"])
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend(["Training Set", "Validation Set"]);

Fine-tuning in this step does not seem to improve the accuracy of the model further, but it is still an important step since it very well could have.

Model Evaluation¶

Now that we have our model, we can test its performance on the test set.

In [ ]:
predictions_prob = model_simp.predict(test)
predictions = np.argmax(predictions_prob, axis = 1)

ground_truth = [label for _, label in test.unbatch()]
ground_truth = tf.concat(ground_truth, axis = 0).numpy()  

accuracy = accuracy_score(ground_truth, predictions)
print("Accuracy of the model:", accuracy)
1/1 [==============================] - 3s 3s/step
Accuracy of the model: 0.9259259259259259

We get an accuracy of 92.6%, which is very strong. We can further analyze the performance by evaluating the confusion matrix.

In [ ]:
fig, ax = plt.subplots(figsize=(12,8))
conf_matrix = confusion_matrix(ground_truth, predictions)
ConfusionMatrixDisplay(conf_matrix, display_labels = category_labels).plot(ax = ax);

From the confusion matrix we can see that most of our misclassifications are one-off misclassifications. Most of these misclassifications make sense, such as metal being misclassified as rock or hiphop being misclassified as pop. One misclassification that stands out is a jpop audio track being misclassified as classical.

To get a better sense of our performance, we can look at the precision, recall, and F1-score of each class. We can write a function that will print out the results. This function can also identify the images that were misclassified and display them.

In [ ]:
def multiclass_accuracy_results(y_true, y_pred, test_images):
    labels = list(set(y_true))
    misclassified_indices = []

    for label in labels:
        new_vec_true = y_true == label
        new_vec_pred = y_pred == label
        misclassified_indices.extend(np.where((new_vec_true != new_vec_pred) & new_vec_true)[0])

        print(f'Class {label}:')
        print(f'Accuracy: {accuracy_score(new_vec_true, new_vec_pred)}')
        print(f'Precision: {precision_score(new_vec_true, new_vec_pred)}')
        print(f'Recall: {recall_score(new_vec_true, new_vec_pred)}')
        print(f'F1-Score: {f1_score(new_vec_true, new_vec_pred)}')
        print()

    misclassified_indices = np.unique(misclassified_indices)
    # Display misclassified images side by side
    plt.figure(figsize=(15, 3))
    for i, idx in enumerate(misclassified_indices):
        plt.subplot(1, len(misclassified_indices), i + 1)
        plt.imshow(test_images[idx].numpy().astype("uint8"))
        plt.title(f'True: {y_true[idx]}, Predicted: {y_pred[idx]}')
        plt.axis('off')
    plt.show()

    return None

test_images = [image for image, label in test.unbatch()]

multiclass_accuracy_results(ground_truth, predictions, test_images)
Class 0:
Accuracy: 0.9814814814814815
Precision: 0.8571428571428571
Recall: 0.8571428571428571
F1-Score: 0.8571428571428571

Class 1:
Accuracy: 0.9907407407407407
Precision: 0.8333333333333334
Recall: 1.0
F1-Score: 0.9090909090909091

Class 2:
Accuracy: 0.9907407407407407
Precision: 0.9090909090909091
Recall: 1.0
F1-Score: 0.9523809523809523

Class 3:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-Score: 1.0

Class 4:
Accuracy: 0.9814814814814815
Precision: 0.9333333333333333
Recall: 0.9333333333333333
F1-Score: 0.9333333333333333

Class 5:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-Score: 1.0

Class 6:
Accuracy: 0.9907407407407407
Precision: 1.0
Recall: 0.9230769230769231
F1-Score: 0.9600000000000001

Class 7:
Accuracy: 0.9814814814814815
Precision: 0.9230769230769231
Recall: 0.9230769230769231
F1-Score: 0.9230769230769231

Class 8:
Accuracy: 0.9907407407407407
Precision: 0.9090909090909091
Recall: 1.0
F1-Score: 0.9523809523809523

Class 9:
Accuracy: 0.9814814814814815
Precision: 1.0
Recall: 0.8823529411764706
F1-Score: 0.9375

Class 10:
Accuracy: 0.9629629629629629
Precision: 0.8
Recall: 0.8
F1-Score: 0.8000000000000002

The jpop image that was misclassified as classical is True: 6, Predicted: 1. We can play the jpop audio clip to see what it sounds like.

In [ ]:
ipd.Audio('./Data/jpop00009.wav')
Out[ ]:
Your browser does not support the audio element.

Note that the image is a random 30 second clip of this song. With that said, it is completely understandible that any part of this song (especially the intro) was misclassified as classical.

Conclusion¶

In conclusion, our model demonstrated robust performance, underscoring the effectiveness of employing image classification techniques for the analysis of spectrographs in the context of audio file genre classification. This success implies broader applications beyond music genres, suggesting the viability of adapting similar approaches to diverse audio-related tasks. Beyond music, this methodology can be applied to other forms of audio classification challenges, such as voice recognition, distinguishing animal vocalizations, or diagnosing automotive issues based on the distinctive sounds they produce. Basically, any audio classification task can be transformed into an image classification task through the use of spectrographs. The versatility of image-based audio analysis showcases the potential for leveraging this technology across various domains.

Sources¶

Audio Data: https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification

VGG16: https://arxiv.org/abs/1409.1556

Apendix A: Generating Spectograms from Audio Files¶

In [ ]:
if False:
    def generate_spectrogram(input_path, output_path, extract_clips=False):
        try:
            if os.path.exists(output_path):
                print(f"Spectrogram already exists for {input_path}. Skipping.")
                return
                
            y, sr = librosa.load(input_path)

            if len(y) > 90 * sr and extract_clips:
                for i in range(3):
                    start_time = i * 30
                    end_time = start_time + 30
                    y_clip = y[start_time * sr:end_time * sr]

                    D = librosa.amplitude_to_db(np.abs(librosa.stft(y_clip)), ref=np.max)

                    plt.figure(figsize=(432/80, 288/80)) 
                    librosa.display.specshow(D, sr=sr, x_axis=None, y_axis=None) 
                    plt.axis('off')

                    clip_output_path = os.path.join(
                        os.path.dirname(output_path), f"{os.path.splitext(os.path.basename(output_path))[0]}_clip{i+1}.png"
                    )
                    # print(f'clip output path : {clip_output_path}')
                    plt.savefig(clip_output_path, bbox_inches='tight', pad_inches=0, dpi=100)
                    plt.close()

            else:
                D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)

                plt.figure(figsize=(432/80, 288/80)) 
                librosa.display.specshow(D, sr=sr, x_axis=None, y_axis=None) 
                plt.axis('off')

                plt.savefig(output_path, bbox_inches='tight', pad_inches=0, dpi=100)
                plt.close()

        except LibrosaError as e:
            print(f"Error processing {input_path}: {e}")
    def process_folder(input_folder, output_folder, batch_size=50, extract_clips=False):
        for root, dirs, files in os.walk(input_folder):
            relative_path = os.path.relpath(root, input_folder)
            output_subfolder = os.path.join(output_folder, relative_path)
            for i in range(0, len(files), batch_size):
                batch_files = files[i:i + batch_size]
                for file in batch_files:
                    if file.endswith(".wav") and not file.startswith("._"):
                        input_path = os.path.join(root, file)

                        os.makedirs(output_subfolder, exist_ok=True)

                        output_path = os.path.join(output_subfolder, f"{os.path.splitext(file)[0]}.png")
                        generate_spectrogram(input_path, output_path, extract_clips)
                        print(f"Spectrogram generated for {file}")

    genres_original = os.path.join(os.getcwd(), os.path.join("Data", "genres_original"))
    spec_original = os.path.join(os.getcwd(), os.path.join("Data", "spec_original"))
    os.makedirs(spec_original, exist_ok=True)

    process_folder(genres_original, spec_original, batch_size=50, extract_clips=True)

    def rename_files_with_prefix(directory_path, prefix='jpop'):
        try:
            files = [f for f in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, f))]

            for i, file_name in enumerate(files, start=1):
                new_file_name = f'{prefix}{i:05d}.png'
                
                old_path = os.path.join(directory_path, file_name)
                new_path = os.path.join(directory_path, new_file_name)

                # Rename the file
                os.rename(old_path, new_path)
                print(f"Renamed '{file_name}' to '{new_file_name}'")

        except FileNotFoundError:
            print(f"Directory '{directory_path}' not found.")
        except Exception as e:
            print(f"An error occurred: {e}")

    directory_to_rename = os.path.join(spec_original, "jpop")
    rename_files_with_prefix(directory_to_rename)